Skip to left side bar
>
  • File
  • Edit
  • View
  • Run
  • Kernel
  • Tabs
  • Settings
  • Help

Open Tabs

  • Classification_Dsci.ipynb
  • Untitled1.ipynb
  • Time series analysis.ipynb
  • Imbalance Data.ipynb

Kernels

  • Classification_Dsci.ipynb
  • Imbalance Data.ipynb
  • Untitled1.ipynb
  • Time series analysis.ipynb

Terminals

    Warning
    The JupyterLab development team is excited to have a robust third-party extension community. However, we do not review third-party extensions, and some extensions may introduce security risks or contain malicious code that runs on your machine.
    Installed
    Discover
    /
    Name
    ...
    Last Modified
      Kernel status: Idle
      [2]:
      import matplotlib.pyplot as plt
      [3]:
      iris = pd.read_csv(r'C:\Users\user\Desktop\iris (3).csv')
      [4]:
      print(iris.shape)
      (150, 6)
      
      [5]:
      iris.head()
      [5]:
      id sepal_len sepal_wd petal_len petal_wd species
      0 0 5.1 3.5 1.4 0.2 iris-setosa
      1 1 4.9 3.0 1.4 0.2 iris-setosa
      2 2 4.7 3.2 1.3 0.2 iris-setosa
      3 3 4.6 3.1 1.5 0.2 iris-setosa
      4 4 5.0 3.6 1.4 0.2 iris-setosa
      [6]:
      iris.drop('id', axis = 1, inplace = True)
      [7]:
      iris.head()
      [7]:
      sepal_len sepal_wd petal_len petal_wd species
      0 5.1 3.5 1.4 0.2 iris-setosa
      1 4.9 3.0 1.4 0.2 iris-setosa
      2 4.7 3.2 1.3 0.2 iris-setosa
      3 4.6 3.1 1.5 0.2 iris-setosa
      4 5.0 3.6 1.4 0.2 iris-setosa
      [8]:
      #summarry statistics
      [8]:
      sepal_len sepal_wd petal_len petal_wd
      count 150.000000 150.000000 150.000000 150.000000
      mean 5.843333 3.057333 3.758000 1.199333
      std 0.828066 0.435866 1.765298 0.762238
      min 4.300000 2.000000 1.000000 0.100000
      25% 5.100000 2.800000 1.600000 0.300000
      50% 5.800000 3.000000 4.350000 1.300000
      75% 6.400000 3.300000 5.100000 1.800000
      max 7.900000 4.400000 6.900000 2.500000
      [9]:
      missing_values = iris.isnull().sum()
      sepal_len    0
      sepal_wd     0
      petal_len    0
      petal_wd     0
      species      0
      dtype: int64
      
      [10]:
      #check the data type of the dataset
      sepal_len    float64
      sepal_wd     float64
      petal_len    float64
      petal_wd     float64
      species       object
      dtype: object
      
      [11]:
      #No missing value and all the features in the dataset are numeric, so we conclude that the dataset is clean
      [12]:
      class_distr = iris['species'].value_counts()
      iris-setosa        50
      iris-versicolor    50
      iris-virginica     50
      Name: species, dtype: int64
      
      [13]:
      import matplotlib.pyplot as plt
      [14]:
      #This gives us a much clearer idea of the distribution of the input variable, showing that both sepal len and sepal width have a normal(Gaussian) distribution.
      [15]:
      #check correlation between the features
      [16]:
      #From the above visualization it is disfficult to seperate "iris versicolor" from "iris virginica" because of the overlap of petal len & wd and also sepal len & wd.
      [17]:
      pd.plotting.scatter_matrix(iris)
      [18]:
      #correlation_matrix = iris['sepal_wd', 'sepal_len','petal_wd', 'petal_len'].corr()
      [19]:
      #We identify that the length and the width are the most useful features to separate the species.
      [32]:
      #Data preparation
      [40]:
      #and test sets have approximately the same percentage of samples of each target class as the complete set.
      [44]:
      X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30, random_state = 1, stratify = Y)
      [45]:
      print(Y_train.value_counts())
      iris-setosa        35
      iris-virginica     35
      iris-versicolor    35
      Name: species, dtype: int64
      iris-virginica     15
      iris-setosa        15
      iris-versicolor    15
      Name: species, dtype: int64
      
      [46]:
      #MODELLING
      [47]:
      from sklearn.neighbors import KNeighborsClassifier
      [48]:
      #Now create an instance knn from the class KNeighborsClassifier
      [53]:
      knn = KNeighborsClassifier(n_neighbors = 5)
      [54]:
      #Note that the only parameter we need to set in this problem is n_neighbors, or k as in knn.
      [55]:
      #Use the data X_train and Y_train to train the model
      [56]:
      #FITTING
      [59]:
      print(knn.fit(X_train, Y_train))
      KNeighborsClassifier()
      
      [60]:
      #We  use most the default values for the parameters, e.g, metric = 'minkowski'
      [61]:
      #sklearn.neighbors.KNeighborsClassifier(n_neighbors = 5, weights = 'uniform', algorithm = 'auto', leaf_size = 30, p =2, metric = 'minkowski', metric_params = npne, n_jobs = none)
      [62]:
      #LABEL PREDICTION
      [63]:
      #To make a prediction in a sckit learn, we can call the method predict(). we are trying to predict the species of iris
      [64]:
      #Let's make the prediction on the test data set and save the output in pred for later review
      [65]:
      pred = knn.predict(X_test)
      [66]:
      #Let rewview the first five prediction
      [71]:
      print(pred[:1])
      ['iris-virginica']
      
      [74]:
      #PROBABILITY PREDICTION
      [75]:
      #Of all classification algorithms implemented in scikit learn, there is an addittional method 'predict_prob'.
      [84]:
      y_pred_prob = knn.predict_proba(X_test)
      [[0. 0. 1.]
       [1. 0. 0.]
       [1. 0. 0.]
       [0. 1. 0.]
       [0. 1. 0.]]
      
      [82]:
      print(pred[:5])
      ['iris-virginica' 'iris-setosa' 'iris-setosa' 'iris-versicolor'
       'iris-versicolor']
      
      [ ]:
      #The 1st is predicted to be iris_virgi and the 
      Kernel status: Idle
      [4]:
      xxxxxxxxxx
       
      #Step 1: Import Libraries
      [2]:
      xxxxxxxxxx
       
      import pandas as pd
      import numpy as np
      import matplotlib.pyplot as plt
      from statsmodels.tsa.stattools import adfuller
      from statsmodels.tsa.seasonal import STL
      from statsmodels.tsa.arima.model import ARIMA
      from statsmodels.tsa.holtwinters import ExponentialSmoothing
      [5]:
      xxxxxxxxxx
       
      #Step 2: Load and Visualize Data
      #Load your time series data into a pandas DataFrame. Ensure that the time column is in the proper datetime format. Use matplotlib to visualize the data.
      [11]:
      xxxxxxxxxx
       
      # Load your time series data
      data = pd.read_csv(r"C:\Users\user\Desktop\AirPassengers.csv")
      ​
      # Convert the date column to a datetime object (if not already)
      data['Month'] = pd.to_datetime(data['Month'])
      ​
      # Set the date column as the DataFrame index
      data.set_index('Month', inplace=True)
      ​
      # Visualize the time series data
      plt.figure(figsize=(10, 6))
      plt.plot(data.index, data['Passengers'], label='Time Series Data')
      plt.xlabel('Date')
      plt.ylabel('Passengers')
      plt.title('Time Series Data')
      plt.legend()
      plt.show()
      [22]:
      xxxxxxxxxx
       
      print(data)
                  Passengers
      Month                 
      1949-01-01         112
      1949-02-01         118
      1949-03-01         132
      1949-04-01         129
      1949-05-01         121
      ...                ...
      1960-08-01         606
      1960-09-01         508
      1960-10-01         461
      1960-11-01         390
      1960-12-01         432
      
      [144 rows x 1 columns]
      
      [23]:
       
      #Step 3: Check for Stationarity
      #Stationarity is a crucial assumption in many time series models. We can use the Augmented Dickey-Fuller test to check for stationarity.
      [24]:
      xxxxxxxxxx
       
      def check_stationarity(series):
          result = adfuller(series)
          print('ADF Statistic:', result[0])
          print('p-value:', result[1])
          print('Critical Values:')
          for key, value in result[4].items():
              print(f'  {key}: {value}')
      ​
      check_stationarity(data['Passengers'])
      ADF Statistic: 0.8153688792060497
      p-value: 0.991880243437641
      Critical Values:
        1%: -3.4816817173418295
        5%: -2.8840418343195267
        10%: -2.578770059171598
      
      [25]:
      xxxxxxxxxx
       
      #If the p-value is less than the significance level (e.g., 0.05), we reject the null hypothesis, indicating that the data is stationary.
      [26]:
      xxxxxxxxxx
       
      #Step 4: Detrending
      [27]:
      xxxxxxxxxx
       
      #If your data has a clear trend, you can detrend it to make it stationary. One way to do this is by differencing the series.
      [28]:
      xxxxxxxxxx
       
      data['Detrended'] = data['Passengers'] - data['Passengers'].shift(1)
      data.dropna(inplace=True)
      ​
      plt.figure(figsize=(10, 6))
      plt.plot(data.index, data['Detrended'], label='Detrended Data')
      plt.xlabel('Date')
      plt.ylabel('Detrended Value')
      plt.title('Detrended Time Series Data')
      plt.legend()
      plt.show()
      [29]:
      xxxxxxxxxx
       
      #Step 5: Seasonality Analysis
      [30]:
      xxxxxxxxxx
       
      #To identify and remove seasonality, we can use seasonal decomposition of time series (STL).
      [31]:
      xxxxxxxxxx
       
      seasonal_decomp = STL(data['Passengers'], seasonal=13)
      result = seasonal_decomp.fit()
      ​
      plt.figure(figsize=(10, 8))
      plt.subplot(4, 1, 1)
      plt.plot(data.index, result.trend, label='Trend')
      plt.xlabel('Date')
      plt.ylabel('Trend')
      plt.legend()
      ​
      plt.subplot(4, 1, 2)
      plt.plot(data.index, result.seasonal, label='Seasonal')
      plt.xlabel('Date')
      plt.ylabel('Seasonal')
      plt.legend()
      ​
      plt.subplot(4, 1, 3)
      plt.plot(data.index, result.resid, label='Residuals')
      plt.xlabel('Date')
      plt.ylabel('Residuals')
      plt.legend()
      ​
      plt.subplot(4, 1, 4)
      plt.plot(data.index, result.observed, label='Original')
      plt.xlabel('Date')
      plt.ylabel('Original')
      plt.legend()
      ​
      plt.tight_layout()
      plt.show()
      [32]:
      xxxxxxxxxx
       
      #Step 6: Smoothing
      [33]:
      xxxxxxxxxx
       
      #Smoothing techniques like moving averages or exponential smoothing can help remove noise and emphasize patterns.
      [36]:
      xxxxxxxxxx
       
      data['Smoothed'] = data['Passengers'].rolling(window=7).mean()  # Moving average with window size 7
      ​
      plt.figure(figsize=(10, 6))
      plt.plot(data.index, data['Passengers'], label='Original Data')
      plt.plot(data.index, data['Smoothed'], label='Smoothed Data')
      plt.xlabel('Date')
      plt.ylabel('Passengers')
      plt.title('Original and Smoothed Time Series Data')
      plt.legend()
      plt.show()
      [37]:
      xxxxxxxxxx
       
      #Step 7: Autocorrelation and Partial Autocorrelation
      [39]:
      xxxxxxxxxx
       
      from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
      ​
      plt.figure(figsize=(12, 4))
      plt.subplot(1, 2, 1)
      plot_acf(data['Detrended'], ax=plt.gca(), lags=20)
      ​
      plt.subplot(1, 2, 2)
      plot_pacf(data['Detrended'], ax=plt.gca(), lags=20)
      ​
      plt.tight_layout()
      plt.show()
      [40]:
      xxxxxxxxxx
       
      #Step 8: Choose and Fit a Model
      [41]:
      xxxxxxxxxx
       
      #Based on the autocorrelation and partial autocorrelation plots, select the orders for the ARIMA model.
      [43]:
      xxxxxxxxxx
       
      # Assume ARIMA(1, 0, 1) as an example
      model = ARIMA(data['Passengers'], order=(1, 0, 1))
      result = model.fit()
      ​
      # Print the model summary
      print(result.summary())
      C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
        self._init_dates(dates, freq)
      C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
        self._init_dates(dates, freq)
      C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
        self._init_dates(dates, freq)
      
                                     SARIMAX Results                                
      ==============================================================================
      Dep. Variable:             Passengers   No. Observations:                  143
      Model:                 ARIMA(1, 0, 1)   Log Likelihood                -696.484
      Date:                Sat, 22 Jul 2023   AIC                           1400.967
      Time:                        23:18:07   BIC                           1412.819
      Sample:                    02-01-1949   HQIC                          1405.783
                               - 12-01-1960                                         
      Covariance Type:                  opg                                         
      ==============================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
      ------------------------------------------------------------------------------
      const        281.4780     57.109      4.929      0.000     169.547     393.410
      ar.L1          0.9365      0.028     33.290      0.000       0.881       0.992
      ma.L1          0.4261      0.076      5.595      0.000       0.277       0.575
      sigma2       974.9441    114.826      8.491      0.000     749.890    1199.999
      ===================================================================================
      Ljung-Box (L1) (Q):                   0.05   Jarque-Bera (JB):                 1.84
      Prob(Q):                              0.82   Prob(JB):                         0.40
      Heteroskedasticity (H):               6.69   Skew:                             0.27
      Prob(H) (two-sided):                  0.00   Kurtosis:                         3.16
      ===================================================================================
      
      Warnings:
      [1] Covariance matrix calculated using the outer product of gradients (complex-step).
      
      [44]:
      xxxxxxxxxx
       
      #Step 9: Model Validation
      [45]:
      xxxxxxxxxx
       
      #Split the data into training and testing sets, fit the model on the training data, and validate its performance on the testing data.
      [48]:
      xxxxxxxxxx
       
      train_size = int(len(data) * 0.8)
      train, test = data.iloc[:train_size], data.iloc[train_size:]
      ​
      model = ARIMA(train['Passengers'], order=(1, 0, 1))
      result = model.fit()
      ​
      # Forecast on the test set
      forecast_values = result.forecast(steps=len(test))
      ​
      # Calculate the Mean Absolute Error (MAE) for validation
      mae = np.mean(np.abs(forecast_values - test['Passengers']))
      print(f'Mean Absolute Error (MAE): {mae:.2f}')
      C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
        self._init_dates(dates, freq)
      C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
        self._init_dates(dates, freq)
      C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
        self._init_dates(dates, freq)
      
      Mean Absolute Error (MAE): 105.61
      
      [49]:
      xxxxxxxxxx
       
      #Step 10: Forecasting
      [50]:
      xxxxxxxxxx
       
      #Once you have a well-fitted model, you can use it to forecast future values.
      [59]:
      xxxxxxxxxx
       
      # Re-fit the model on the entire dataset
      model = ARIMA(data['Passengers'], order=(1, 0, 1))
      result = model.fit()
      ​
      # Forecast future values
      forecast_steps = 12  # For example, forecast 12 steps into the future
      forecast_values = result.forecast(steps=forecast_steps)
      ​
      # Plot the original data and the forecasted values
      plt.figure(figsize=(10, 6))
      plt.plot(data.index, data['Passengers'], label='Original Data')
      plt.plot(pd.date_range(data.index[-1], periods=forecast_steps + 1, closed='right'), forecast_values, label='Forecasted Passengers', color='orange')
      plt.xlabel('Date')
      plt.ylabel('Passengers')
      plt.title('Original Data and Forecasted Passengers')
      plt.legend()
      plt.show()
      C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
        self._init_dates(dates, freq)
      C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
        self._init_dates(dates, freq)
      C:\ProgramData\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:471: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
        self._init_dates(dates, freq)
      C:\Users\user\AppData\Local\Temp\ipykernel_9592\528828871.py:12: FutureWarning: Argument `closed` is deprecated in favor of `inclusive`.
        plt.plot(pd.date_range(data.index[-1], periods=forecast_steps + 1, closed='right'), forecast_values, label='Forecasted Passengers', color='orange')
      
      [61]:
      xxxxxxxxxx
       
      print("Congratulations! You have completed an in-depth time series analysis. You've learned how to load and visualize time series data, check for stationarity, detrend the data, analyze seasonality, smooth the data, perform autocorrelation and partial autocorrelation analysis, choose and fit an ARIMA model, validate the model, and make future forecasts.Time series analysis is a powerful tool for understanding and forecasting data with temporal patterns. However, keep in mind that this tutorial only scratches the surface of the vast field of time series analysis. There are many more advanced techniques and models to explore, such as seasonal ARIMA (SARIMA), seasonal decomposition of time series with trend and seasonality (STL-ATS), state-space models, and machine learning approaches like LSTM (Long Short-Term Memory) networks.Remember that the effectiveness of time series analysis depends on the quality of your data, the appropriateness of the selected model, and the accuracy of the forecasting assumptions. Always validate your results, and consider the context and domain knowledge when interpreting the outcomes.As you continue to explore time series analysis, I encourage you to work on different datasets, experiment with various models and techniques, and stay updated with the latest advancements in the field. This will help you become more proficient in analyzing and forecasting time series data for various real-world applications. Happy analyzing!")
      Congratulations! You have completed an in-depth time series analysis. You've learned how to load and visualize time series data, check for stationarity, detrend the data, analyze seasonality, smooth the data, perform autocorrelation and partial autocorrelation analysis, choose and fit an ARIMA model, validate the model, and make future forecasts.Time series analysis is a powerful tool for understanding and forecasting data with temporal patterns. However, keep in mind that this tutorial only scratches the surface of the vast field of time series analysis. There are many more advanced techniques and models to explore, such as seasonal ARIMA (SARIMA), seasonal decomposition of time series with trend and seasonality (STL-ATS), state-space models, and machine learning approaches like LSTM (Long Short-Term Memory) networks.Remember that the effectiveness of time series analysis depends on the quality of your data, the appropriateness of the selected model, and the accuracy of the forecasting assumptions. Always validate your results, and consider the context and domain knowledge when interpreting the outcomes.As you continue to explore time series analysis, I encourage you to work on different datasets, experiment with various models and techniques, and stay updated with the latest advancements in the field. This will help you become more proficient in analyzing and forecasting time series data for various real-world applications. Happy analyzing!
      
      [ ]:
      xxxxxxxxxx
       
      ​
      Kernel status: Idle Executed 1 cellElapsed time: 1 second
      [2]:
      xxxxxxxxxx
       
      #Sure, lets use an example with imbalanced data. In this scenario, we \n'll use the \n"Credit Card Fraud Detection \n" dataset from Kaggle. This dataset contains transactions made by credit cards, and the goal is to detect fraudulent transactions, which are typically a very small proportion of the total transactions, making the data imbalanced.
      [20]:
      xxxxxxxxxx
       
      import numpy as np
      import pandas as pd
      from sklearn.model_selection import train_test_split
      from sklearn.preprocessing import StandardScaler
      from sklearn.linear_model import LogisticRegression
      from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
      ​
      # Load the credit card fraud dataset
      data = pd.read_csv(r"C:\Users\user\Desktop\creditcard.csv")
      ​
      # Explore the dataset
      print(data['Class'].value_counts())
      print(data['Class'].head())
      #we have 284315 for 0 and 492 for 1, which show that the data is imbalance "note: we choose 'Class column as our target variable
      0    284315
      1       492
      Name: Class, dtype: int64
      0    0
      1    0
      2    0
      3    0
      4    0
      Name: Class, dtype: int64
      
      [6]:
       
      #Step 2: Prepare the data for modeling.
      [21]:
      xxxxxxxxxx
       
      # Separate features and target variable
      X = data.drop('Class', axis=1) #we drop claas column from the variables 
      y = data['Class'] #we choose class our target variable.
      ​
      # Split the data into training and testing sets
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
      ​
      # Standardize the features
      scaler = StandardScaler()
      X_train = scaler.fit_transform(X_train)
      X_test = scaler.transform(X_test)
      [10]:
       
      #Step 3: Choose a machine learning algorithm (Logistic Regression) and train the model.
      [6]:
       
      # Create and train a Logistic Regression classifier
      logistic_model = LogisticRegression(random_state=42)
      logistic_model.fit(X_train, y_train)
      [6]:
      LogisticRegression(random_state=42)
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
      LogisticRegression(random_state=42)
      [12]:
       
      #Step 4: Make predictions on the test data and evaluate the model.
      [13]:
       
      # Make predictions on the test data
      y_pred = logistic_model.predict(X_test)
      ​
      # Evaluate the model
      accuracy = accuracy_score(y_test, y_pred)
      print("Accuracy:", accuracy)
      ​
      # Print the classification report for more detailed evaluation
      print(classification_report(y_test, y_pred))
      ​
      # Confusion Matrix
      conf_matrix = confusion_matrix(y_test, y_pred)
      print("Confusion Matrix:")
      print(conf_matrix)
      Accuracy: 0.9991222218320986
                    precision    recall  f1-score   support
      
                 0       1.00      1.00      1.00     56864
                 1       0.86      0.58      0.70        98
      
          accuracy                           1.00     56962
         macro avg       0.93      0.79      0.85     56962
      weighted avg       1.00      1.00      1.00     56962
      
      Confusion Matrix:
      [[56855     9]
       [   41    57]]
      
      [14]:
      xxxxxxxxxx
       
      #Step 5: Handle imbalanced data using techniques like "class weights" or "resampling."
      [16]:
      xxxxxxxxxx
       
      # Option 1: Using class weights
      logistic_model_weighted = LogisticRegression(class_weight='balanced', random_state=42)
      logistic_model_weighted.fit(X_train, y_train)
      ​
      # Option 2: Using resampling techniques like SMOTE (Synthetic Minority Over-sampling Technique)
      from imblearn.over_sampling import SMOTE
      ​
      smote = SMOTE(random_state=42)
      X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
      ​
      logistic_model_resampled = LogisticRegression(random_state=42)
      logistic_model_resampled.fit(X_train_resampled, y_train_resampled)
      [16]:
      LogisticRegression(random_state=42)
      In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
      On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
      LogisticRegression(random_state=42)
      [19]:
      xxxxxxxxxx
       
      print("By handling the imbalanced data, you improve the model's performance in detecting the minority class (fraudulent transactions). Techniques like using class weights or resampling can help mitigate the impact of imbalanced data on the model \n's training and lead to more accurate predictions for the minority class. Keep in mind that imbalanced data is a common challenge in machine learning, and there are various other techniques and algorithms designed to address this issue, such as using different evaluation metrics (e.g., ROC-AUC, precision-recall curves) or employing ensemble methods like Random Forest and Gradient Boosting, which are often robust to imbalanced data.")
      By handling the imbalanced data, you improve the model's performance in detecting the minority class (fraudulent transactions). Techniques like using class weights or resampling can help mitigate the impact of imbalanced data on the model 
      's training and lead to more accurate predictions for the minority class. Keep in mind that imbalanced data is a common challenge in machine learning, and there are various other techniques and algorithms designed to address this issue, such as using different evaluation metrics (e.g., ROC-AUC, precision-recall curves) or employing ensemble methods like Random Forest and Gradient Boosting, which are often robust to imbalanced data.
      
      [20]:
      xxxxxxxxxx
       
      #Step 6: Evaluate the models with imbalanced data handling.
      [21]:
      xxxxxxxxxx
       
      # Option 1: Using class weights
      y_pred_weighted = logistic_model_weighted.predict(X_test)
      ​
      print("Model with Class Weights:")
      accuracy_weighted = accuracy_score(y_test, y_pred_weighted)
      print("Accuracy:", accuracy_weighted)
      print(classification_report(y_test, y_pred_weighted))
      conf_matrix_weighted = confusion_matrix(y_test, y_pred_weighted)
      print("Confusion Matrix:")
      print(conf_matrix_weighted)
      ​
      # Option 2: Using resampling with SMOTE
      y_pred_resampled = logistic_model_resampled.predict(X_test)
      ​
      print("Model with Resampling (SMOTE):")
      accuracy_resampled = accuracy_score(y_test, y_pred_resampled)
      print("Accuracy:", accuracy_resampled)
      print(classification_report(y_test, y_pred_resampled))
      conf_matrix_resampled = confusion_matrix(y_test, y_pred_resampled)
      print("Confusion Matrix:")
      print(conf_matrix_resampled)
      Model with Class Weights:
      Accuracy: 0.9763702117200941
                    precision    recall  f1-score   support
      
                 0       1.00      0.98      0.99     56864
                 1       0.06      0.92      0.12        98
      
          accuracy                           0.98     56962
         macro avg       0.53      0.95      0.55     56962
      weighted avg       1.00      0.98      0.99     56962
      
      Confusion Matrix:
      [[55526  1338]
       [    8    90]]
      Model with Resampling (SMOTE):
      Accuracy: 0.9745970998209332
                    precision    recall  f1-score   support
      
                 0       1.00      0.97      0.99     56864
                 1       0.06      0.92      0.11        98
      
          accuracy                           0.97     56962
         macro avg       0.53      0.95      0.55     56962
      weighted avg       1.00      0.97      0.99     56962
      
      Confusion Matrix:
      [[55425  1439]
       [    8    90]]
      
      [22]:
      xxxxxxxxxx
       
      #By evaluating the models using different techniques to handle imbalanced data, you should notice the improvements in performance for detecting the minority class (fraudulent transactions). The class weights help the model give more importance to the minority class during training, and the SMOTE technique generates synthetic samples to balance the dataset, making the model better at identifying the minority class.
      [23]:
      xxxxxxxxxx
       
      #Keep in mind that the choice between class weights and resampling techniques may vary based on the specific dataset and the characteristics of the problem. Experimenting with different approaches and evaluating their performance is essential in addressing the challenges posed by imbalanced data.
      [24]:
      xxxxxxxxxx
       
      #In practice, you might also want to consider other techniques such as ensemble methods (e.g., RandomForest, Gradient Boosting) with imbalanced data handling to further improve the model's performance. Additionally, feature engineering, hyperparameter tuning, and feature selection are also crucial aspects of the machine learning process that can impact the model's effectiveness.
      [ ]:
      xxxxxxxxxx
       
      ​
      Kernel status: Idle Executed 2 cellsElapsed time: 49 seconds
      [27]:
       
      import numpy as np
      import pandas as pd
      from sklearn.datasets import load_wine
      from pandas.plotting import scatter_matrix
      import matplotlib.pyplot as plt
      [14]:
       
      data = load_wine()
      wine = pd.DataFrame(data.data, columns=data.feature_names)
      [15]:
       
      print(wine.shape)
      (178, 13)
      
      [16]:
       
      print(wine.columns)
      Index(['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium',
             'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
             'proanthocyanins', 'color_intensity', 'hue',
             'od280/od315_of_diluted_wines', 'proline'],
            dtype='object')
      
      [17]:
       
      print(wine.iloc[:, :3].describe())
                alcohol  malic_acid         ash
      count  178.000000  178.000000  178.000000
      mean    13.000618    2.336348    2.366517
      std      0.811827    1.117146    0.274344
      min     11.030000    0.740000    1.360000
      25%     12.362500    1.602500    2.210000
      50%     13.050000    1.865000    2.360000
      75%     13.677500    3.082500    2.557500
      max     14.830000    5.800000    3.230000
      
      [18]:
       
      print(wine.describe())
                alcohol  malic_acid         ash  alcalinity_of_ash   magnesium  \
      count  178.000000  178.000000  178.000000         178.000000  178.000000   
      mean    13.000618    2.336348    2.366517          19.494944   99.741573   
      std      0.811827    1.117146    0.274344           3.339564   14.282484   
      min     11.030000    0.740000    1.360000          10.600000   70.000000   
      25%     12.362500    1.602500    2.210000          17.200000   88.000000   
      50%     13.050000    1.865000    2.360000          19.500000   98.000000   
      75%     13.677500    3.082500    2.557500          21.500000  107.000000   
      max     14.830000    5.800000    3.230000          30.000000  162.000000   
      
             total_phenols  flavanoids  nonflavanoid_phenols  proanthocyanins  \
      count     178.000000  178.000000            178.000000       178.000000   
      mean        2.295112    2.029270              0.361854         1.590899   
      std         0.625851    0.998859              0.124453         0.572359   
      min         0.980000    0.340000              0.130000         0.410000   
      25%         1.742500    1.205000              0.270000         1.250000   
      50%         2.355000    2.135000              0.340000         1.555000   
      75%         2.800000    2.875000              0.437500         1.950000   
      max         3.880000    5.080000              0.660000         3.580000   
      
             color_intensity         hue  od280/od315_of_diluted_wines      proline  
      count       178.000000  178.000000                    178.000000   178.000000  
      mean          5.058090    0.957449                      2.611685   746.893258  
      std           2.318286    0.228572                      0.709990   314.907474  
      min           1.280000    0.480000                      1.270000   278.000000  
      25%           3.220000    0.782500                      1.937500   500.500000  
      50%           4.690000    0.965000                      2.780000   673.500000  
      75%           6.200000    1.120000                      3.170000   985.000000  
      max          13.000000    1.710000                      4.000000  1680.000000  
      
      [26]:
      xxxxxxxxxx
       
      target_names = data.target_names
      target = data.target
      colors = [target_names[i] for i in target]
      scatter_matrix(wine, alpha=0.8, figsize=(12, 12), diagonal='hist', color ='c')
      #plt.savefig("plot.png")
      plt.legend(target_names, loc='upper right')
      plt.show()
      [22]:
       
      scatter_matrix(wine.iloc[:,:5])
      plt.show()
      [ ]:
       
      ​
      • Classification_Dsci.ipynb
      • Untitled1.ipynb
      • Time series analysis.ipynb
      • Imbalance Data.ipynb
      [16]:
      xxxxxxxxxx
      print(wine.columns)
      Advanced Tools
      xxxxxxxxxx
      xxxxxxxxxx

      -

      Variables

      Callstack

        Breakpoints

        Source

        xxxxxxxxxx
        1

        Kernel Sources

          0
          4
          Python 3 (ipykernel) | Idle
          Saving completed
          Uploading…

          Untitled1.ipynb
          Spaces: 4
          Ln 1, Col 20
          Mode: Command
          • Console
          • Change Kernel…
          • Clear Console Cells
          • Close and Shut Down…
          • Insert Line Break
          • Interrupt Kernel
          • New Console
          • Restart Kernel…
          • Run Cell (forced)
          • Run Cell (unforced)
          • Show All Kernel Activity
          • Debugger
          • Continue
            Continue
            F9
          • Enable / Disable pausing on exceptions
            Enable / Disable pausing on exceptions
          • Evaluate Code
            Evaluate Code
          • Next
            Next
            F10
          • Step In
            Step In
            F11
          • Step Out
            Step Out
            Shift+F11
          • Terminate
            Terminate
            Shift+F9
          • Extension Manager
          • Enable Extension Manager
          • File Operations
          • Autosave Documents
          • Download
            Download the file to your computer
          • Open from Path…
            Open from path
          • Open from URL…
            Open from URL
          • Reload Notebook from Disk
            Reload contents from disk
          • Revert Notebook to Checkpoint
            Revert contents to previous checkpoint
          • Save Notebook
            Save and create checkpoint
            Ctrl+S
          • Save Notebook As…
            Save with new path
            Ctrl+Shift+S
          • Show Active File in File Browser
          • Trust HTML File
            Whether the HTML file is trusted. Trusting the file allows scripts to run in it, which may result in security risks. Only enable for files you trust.
          • Help
          • About JupyterLab
          • Jupyter Forum
          • Jupyter Reference
          • JupyterLab FAQ
          • JupyterLab Reference
          • Launch Classic Notebook
          • Licenses
          • Markdown Reference
          • Reset Application State
          • Image Viewer
          • Flip image horizontally
            H
          • Flip image vertically
            V
          • Invert Colors
            I
          • Reset Image
            0
          • Rotate Clockwise
            ]
          • Rotate Counterclockwise
            [
          • Zoom In
            =
          • Zoom Out
            -
          • Kernel Operations
          • Shut Down All Kernels…
          • Launcher
          • New Launcher
          • Main Area
          • Activate Next Tab
            Ctrl+Shift+]
          • Activate Next Tab Bar
            Ctrl+Shift+.
          • Activate Previous Tab
            Ctrl+Shift+[
          • Activate Previous Tab Bar
            Ctrl+Shift+,
          • Activate Previously Used Tab
            Ctrl+Shift+'
          • Close All Other Tabs
          • Close All Tabs
          • Close Tab
            Alt+W
          • Close Tabs to Right
          • Find Next
            Ctrl+G
          • Find Previous
            Ctrl+Shift+G
          • Find…
            Ctrl+F
          • Log Out
            Log out of JupyterLab
          • Presentation Mode
          • Show Header Above Content
          • Show Left Sidebar
            Ctrl+B
          • Show Log Console
          • Show Right Sidebar
          • Show Status Bar
          • Shut Down
            Shut down JupyterLab
          • Simple Interface
            Ctrl+Shift+D
          • Notebook Cell Operations
          • Change to Code Cell Type
            Y
          • Change to Heading 1
            1
          • Change to Heading 2
            2
          • Change to Heading 3
            3
          • Change to Heading 4
            4
          • Change to Heading 5
            5
          • Change to Heading 6
            6
          • Change to Markdown Cell Type
            M
          • Change to Raw Cell Type
            R
          • Clear Outputs
          • Collapse All Code
          • Collapse All Outputs
          • Collapse Selected Code
          • Collapse Selected Outputs
          • Copy Cells
            Copy the selected cells
            C
          • Cut Cells
            Cut the selected cells
            X
          • Delete Cells
            D, D
          • Disable Scrolling for Outputs
          • Enable Scrolling for Outputs
          • Expand All Code
          • Expand All Outputs
          • Expand Selected Code
          • Expand Selected Outputs
          • Extend Selection Above
            Shift+K
          • Extend Selection Below
            Shift+J
          • Extend Selection to Bottom
            Shift+End
          • Extend Selection to Top
            Shift+Home
          • Insert Cell Above
            A
          • Insert Cell Below
            Insert a cell below
            B
          • Merge Cell Above
            Ctrl+Backspace
          • Merge Cell Below
            Ctrl+Shift+M
          • Merge Selected Cells
            Shift+M
          • Move Cells Down
          • Move Cells Up
          • Paste Cells Above
          • Paste Cells and Replace
          • Paste Cells Below
            Paste cells from the clipboard
            V
          • Redo Cell Operation
            Shift+Z
          • Render Side-by-Side
            Shift+R
          • Run Selected Cells
            Shift+Enter
          • Run Selected Cells and Don't Advance
            Ctrl+Enter
          • Run Selected Cells and Insert Below
            Alt+Enter
          • Run Selected Text or Current Line in Console
          • Select Cell Above
            K
          • Select Cell Below
            J
          • Set side-by-side ratio
          • Split Cell
            Ctrl+Shift+-
          • Undo Cell Operation
            Z
          • Notebook Operations
          • Change Kernel…
          • Clear All Outputs
          • Close and Shut Down
          • Collapse All Cells
          • Deselect All Cells
          • Enter Command Mode
            Ctrl+M
          • Enter Edit Mode
            Enter
          • Expand All Headings
          • Interrupt Kernel
          • New Console for Notebook
          • New Notebook
            Create a new notebook
          • Reconnect To Kernel
          • Render All Markdown Cells
          • Restart Kernel and Clear All Outputs…
          • Restart Kernel and Debug…
            Restart Kernel and Debug…
          • Restart Kernel and Run All Cells…
          • Restart Kernel and Run up to Selected Cell…
          • Restart Kernel…
          • Run All Above Selected Cell
          • Run All Cells
          • Run Selected Cell and All Below
          • Save and Export Notebook: Asciidoc
          • Save and Export Notebook: Executable Script
          • Save and Export Notebook: HTML
          • Save and Export Notebook: LaTeX
          • Save and Export Notebook: Markdown
          • Save and Export Notebook: PDF
          • Save and Export Notebook: ReStructured Text
          • Save and Export Notebook: Reveal.js Slides
          • Save and Export Notebook: Webpdf
          • Select All Cells
            Ctrl+A
          • Toggle All Line Numbers
            Shift+L
          • Toggle Collapse Notebook Heading
          • Trust Notebook
          • Settings
          • Advanced JSON Settings Editor
          • Advanced Settings Editor
          • Show Contextual Help
          • Show Contextual Help
            Live updating code documentation from the active kernel
            Ctrl+I
          • Terminal
          • Decrease Terminal Font Size
          • Increase Terminal Font Size
          • New Terminal
            Start a new terminal session
          • Refresh Terminal
            Refresh the current terminal session
          • Use Terminal Theme: Dark
            Set the terminal theme
          • Use Terminal Theme: Inherit
            Set the terminal theme
          • Use Terminal Theme: Light
            Set the terminal theme
          • Text Editor
          • Decrease Font Size
          • Increase Font Size
          • Indent with Tab
          • New Markdown File
            Create a new markdown file
          • New Python File
            Create a new Python file
          • New Text File
            Create a new text file
          • Spaces: 1
          • Spaces: 2
          • Spaces: 4
          • Spaces: 8
          • Theme
          • Decrease Code Font Size
          • Decrease Content Font Size
          • Decrease UI Font Size
          • Increase Code Font Size
          • Increase Content Font Size
          • Increase UI Font Size
          • Theme Scrollbars
          • Use Theme: JupyterLab Dark
          • Use Theme: JupyterLab Light